During the past few years of the COVID-19 pandemic, medical face masks have become of increasing importance to combat the rapid spread of the virus. As a result, organizations and governments have started requiring individuals to be wearing face masks in numerous places, occasions, and circumstances. Therefore, it is the role of software engineers to automate the checking process of face masks. In this proposed project, we will create a face mask detection system, where it would output whether a person is wearing a face mask or not. It would also identify if the mask is not worn properly. This system can be of high value and be used in numerous ways such as: disallowing entry or enforcing fines on unmasked individuals.
For our dataset, we merged considerable parts of datasets to create a combined, balanced dataset. We used a dataset of size 6.03 GBs containing 20,000+ images of correctly masked, incorrectly masked, and unmasked human faces. The datasets which we used to form our combined dataset were:
In our dataset, we ensured to divide our dataset evenly between the three classes and contain a distribution of images of different ethnicities, ages, and genders in order to ensure a balanced training and validation dataset.
Our proposed system takes an image of a person as an input, and outputs one of three classes: whether the person is wearing a mask correctly, wearing a mask incorrectly, or not wearing a mask at all.
CNN models are widely considered as the go-to models for image classification problems. Below is a table comparing the different CNN architectures, showing the accuracy, size, and training time of different CNN architectures.
CNN stands for Convolutional Neural Network which is a specialized neural network for processing data that has an input shape like a 2D matrix like images. CNN models build a network with the neurons connected together in a 2D structure, and then the output is processed and produced. CNN's are typically used for image detection and classification as they process data in a grid-like structure, making them more useful for computer vision applications.
The model uses a pre-trained ResNet50 Convolutional Neural Net model, and uses transfer learning to learn weights of only the last layer of the net. The use of transfer learing would make use of a pre-trained model to give the model better generalization, while saving training time in less epochs. ResNet-50 is a CNN model that has a depth of 50. It is considered as a modern convolutional neural network that allows for deeper networks. It also has pooling and works on the TensorFlow framework. This model produced an accuracy of 98% using the Face Mask Detection dataset which is a set of 7000+ images labeled as withmask and withoutmask.
All of the above models only produce binary outputs (wearing a mask / not wearing a mask). Our proposed update is creating a model which identifies whether a mask is properly worn or not, in addition to the two base outputs. This means that our tentative model would have three possible outputs, which are: mask worn properly, mask not worn, mask worn improperly.
Since there exists several CNN architectures, which would give valid, accurate results, we have decided to use Ensemble Learning. We would use three CNN architectures ensembled in the hopes of achieving a better predictive performance than if only one architecture was used. The three CNN models which we inteded on using were:
In the original model reference, a relatively small, bi-class dataset was used with a size of 174.7 MB. To better train our model for a better generalization, we intend to use a much larger 6.03 GBs dataset which includes the three classes of images.
In order to prevent overfitting of our model and to ensure better generalization, we used early stopping during training of our models. This is especially useful since we use an iterative model. In addition, callbacks helped us ensure that we get the best possible result from our training.
To further prevent overfitting, we used data augmentation on our dataset such as rotation of the image and zoom. This technique acts as a regularizer by increasing the data input to our model training by modifying the inputs or generating new data.
In the ensembling, we used three CNN architectures that produced similar, very accurate results on our dataset. We used several metrics to judge the performance of our models: training loss, validation loss, and validation accuracy. Below are the tables showing the performance of each of the architectures before ensembling them together. The DenseNet model provided the highest accuracy out of the three models, and as their outputs are averaged together, our final model would give an even better predictive performance. These figures show an improvement to the original architecture, which had an accuracy of only 98%, where our models exceeded 99%.
In addition, confusion matrices were used for each model architecture to track the models' predictions relative to the ground truth. As shown by the figures below, all three models gave very accurate predictions with a very slight variation in their accuracy.
We are very pleased with the high accuracy our model has reached and its performance. During the course of the project, we learned a lot more about the practical implementation of deep learning model and got exposed to things we would not have known otherwise. We needed to make frequent changes to the model due to the difficulties that we faced in different milestones; this made us research for different solutions to solve these difficulties and come up with more creative soltuions.